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Abstract 

Large language models have been proven quite beneficial for a variety of automatic speech recognition tasks 
in Google. We summarize results on Voice Search and a few YouTube speech transcription tasks to highlight the 
impact that one can expect from increasing both the amount of training data, and the size of the language model 
estimated from such data. Depending on the task, availability and amount of training data used, language model 
size and amount of work and care put into integrating them in the lattice rescoring step we observe reductions in 
T— I word error rate between 6% and 10% relative, for systems on a wide range of operating points between 17% and 

52% word error rate. 
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I. Introduction 

CO A statistical language model estimates the prior probability values P{W) for strings of words in a vocabulary 

1 y V whose size is usually in the tens or hundreds of thousands. Typically the string W is broken into sentences, or 

other segments such as utterances in automatic speech recognition (ASR), which are assumed to be conditionally 

O independent. For the rest of this chapter, we will assume that W is such a segment, or sentence. With W = 

^ wi,W2,...,WnWe get 
> 

^ P{W) = l[P{Wi\wi,W2,...,Wi.l) (1) 

1=1 

oq 

O Since the parameter space of P{wk\wi,W2, ■ ■ ■ ,Wk-i) is too large, the language model is forced to put the 
CN context Wk-i = wi,W2, ■■ ■ , w^-i into an equivalence class determined by a function ^{Wk-i)- As a result, 

^ n 

P{W)^\{P{wk\^{Wk-i)) (2) 



X 



fc=l 

Research in language modeling consists of finding appropriate equivalence classifiers $ and methods to estimate 
P{wk\^{Wk-i)). 

The most successful paradigm in language modehng uses the (n — l)-gram equivalence classification, that is, 
defines 

^{Wk-l) = Wk-n+l,Wk-n+2, • • • , Wk-1 



Once the form ^{Wk-i) is specified, only the problem of estimating P{wk\^{Wk-i)) from training data remains. 
In most practical cases, n = 3 which leads to a trigram language model. 

AH authors are with Google, Inc., 1600 Amphiteatre Pkwy, Mountain View, CA 94043, USA. 
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A commonly used quality measure for a given model M is related to the entropy of the underlying source and 
was introduced under the name of perplexity (PPL) HI: 



A more relevant metric for ASR is the word error rate (WER) achieved when using a give language model in a 
speech recognition system. 

The distributed language model architecture described in |2| can be used for training and serving very large 
language models. We have implemented lattice rescoring in this setup, and experimented with such large distributed 
language models on various Google internal tasks. 



We have trained query LMs in the following setup 111: 

• vocabulary size: IM words, OOV rate 0.57% 

• training data: 230B words, a random sample of anonymized queries from google.com that did not trigger 
spelling correction. 

The test set was gathered using an Adroid application. People were prompted to speak a set of random google.com 
queries selected from a time period that does not overlap with the training data. 

The work described in [41 and [Si enables us to evaluate relatively large query language models in the 1-st pass 
of our ASR decoder by representing the language model in the OpenFst ||6l framework. Figures [T]|2] show the PPL 
and word error rate (WER) for two language models (3-gram and 5-gram, respectively) built on the 230B training 
data, after entropy pruning to various sizes in the range 15 million - 1.5 billion n-grams. 

As can be seen, perplexity is very well correlated with WER, and the size of the language model has a significant 
impact on speech recognition accuracy: increasing the model size by two orders of magnitude reduces the WER 
by 10% relative. 

We have also implemented lattice rescoring using the distributed language model architecture described in lH, 
see the results presented in Table |l] 

This enables us to validate empirically the fact that rescoring lattices generated with a relatively small 1-st pass 
language model (in this case 15 million 3-gram, denoted 15M 3-gram in Table|l]) yields the same results as 1-st pass 
decoding with a large language model. A secondary benefit of the lattice rescoring setup is that one can evaluate 
the ASR performance of much larger language models. 




(3) 



k=l 



IL Voice Search Experiments 
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Perplexity (left) and Word Error Rate (right) as a function of LM size 




20.5 



19.5 



18.5 



17.5 



LM size: # n-grams(B, log scale) 
Fig. 1 ; 3-gram language model perplexity and word error rate as a function of language model size; lower curve is PPL. 
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TABLE 1; Speech recognition language model performance when used in the 1-st pass or in the 2-nd pass — lattice rescoring. 



III. YouTuBE Experiments 

YouTube data is extremely challenging for current ASR technology. As far as language modeling is concerned, 
the variety of topics and speaking styles makes a language model built from a web crawl a very attractive choice. 



A. 2011 YouTube Test Set 

A second batch of experiments were carried out in a different training and test setup, using more recent and also 
more challenging YouTube speech data. 
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On the acoustic modeling side, the training data for the YouTube system consisted of approximately 1400 hours 
of data from YouTube. The system used 9-frame MFCCs that were transformed by LDA and SAT was performed. 
Decision tree clustering was used to obtain 17552 triphone states, and STCs were used in the GMMs to model 
the features. The acoustic models were further improved with bMMI Q. During decoding. Constrained Maximum 
Likelihood Linear Regression (CMLLR) and Maximum Likelihood Linear Regression (MLLR) transforms were 
applied. 

The training data used for language modeling consisted of Broadcast news acoustic transcriptions (approx. 1.6 
million words). Broadcast news LM text distributed by LDC (approx. 128 million words), and a web crawl from 
October 2008 (approx. 12 billion words). Each data source was used to train a separate interpolated Kneser-Ney 
4-gram language model, of size 3.5 million, 112 million and 5.6 billion n-grams, respectively. 

The first pass language model was obtained by interpolating the three components above, after pruning each of 
them to 3-gram order and about lOM n-grams. Interpolation weights were estimated such that they maximized the 
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probability of a held-out set consisting of manual transcription of YouTube utterances. 

For lattice rescoring, the three language models were combined with the 1-st pass acoustic model score and the 
insertion penalty using MERT fSl. 

The test set consisted of 10 hours of randomly selected YouTube speech data. 

Table [n] presents the results in various rescoring configurations: 

• 2nd, MERT uses lattice MERT to compute the optimal weights for mixing the three language model scores, 
along with acoustic model score and insertion penalty. It achieves 3.2% absolute reduction in WER. Despite 
the very high error rate of the baseline this amounts to 6% relative reduction in WER. 

• 2nd, unif uses uniform weights across the three language models, quantifying the gain that can be attributed 
to MERT (0.6% absolute). 

• 2nd, no www throws away the www LM from the mix to evaluate its contribution: 1.2% absolute reduction 
in WER. 
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TABLE 11: YouTube 2011 test set: Lattice rescoring using a large language model trained on web crawl. 



Experiments on a development set collected at the same time with the test set insert the large LM rescoring at 
various stages in the rescoring pipeline, using increasingly powerful acoustic models, as reported in O. The results 



are reported in Table III 



Pass 


Acoustic Model 


Language Model 


Size 


WER (%) 


Baseline 


baseline AM 


14M 3-gram 




52.8 


2nd 


baseline AM 


5.6B 4-gram 


LARGE 


49.4 


better AM 


DBN + tuning 


14M 3-gram 




49.4 


2nd 


DBN + tuning 


5.6B 4-gram 


LARGE 


45.4 


even better AM 


MMI DBN + tuning 


14M 3 -gram 




48.8 


2nd 


MMI DBN + tuning 


5.6B 4-gram 


LARGE 


45.2 



TABLE III: YouTube 20 11 dev set: Lattice rescoring using a large language model trained on web crawl. Lattices are generated 
with increasingly powerful acoustic models. 



We observe consistent gains between 6% and 9% relative, 3.4-4.0% absolute at various operating points in WER 
due to more powerful acoustic models. As a side note, the gains from large LM rescoring are comparable to those 
obtained by using deep-belief NN acoustic models (DBN). 
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B. 2008 YouTube Test Set 

In a different batch of YouTube experiments, Thadani et al. lITOl train a language model on a web crawl from 
2010, filtered to retain only documents in English. 

The training data used for language modeling consisted of Broadcast news acoustic transcriptions (approx. 1.6 
million words). Broadcast news LM text distributed by LDC (approx. 128 million words), and a web crawl from 
2010 (approx. 59 billion words). Each data source was used to train a separate interpolated Kneser-Ney 4-gram 
language model, of size 3.5 million, 112 million and 19 billion n-grams, respectively. 

The first pass language model was obtained by interpolating the three components above, after pruning each of 
them to 3 -gram order and about lOM n-grams. 

For lattice rescoring, the three unpruned language models were combined using linear interpolation. 

For both first-pass and rescoring language models, interpolation weights were estimated such that they maximized 
the probabiUty of a held-out set consisting of manual transcription of YouTube utterances. 

The test corpus consisted of 77 videos containing news broadcast style material downloaded in 2008 Hill. They 
were automatically segmented into short utterances based on pauses between speech. The audio was transcribed at 
high quality by humans trained in the task. 

Table IV highlights the large LM rescoring results presented in |fTOl. 
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TABLE IV: YouTube 2008 test set: Lattice rescoring using a large language model trained on web crawl. 

The large language model used for lattice rescoring decreased the WER by 2.8% absolute, or 8% relative, a 
significant improvement in accuracy. 

IV. Conclusions 

Large n-gram language models are a simple yet very effective way of improving the performance of real world 
ASR systems. Depending on the task, availability and amount of training data used, language model size and 
amount of work and care put into integrating them in the lattice rescoring step we observe improvements in WER 
between 6% and 10% relative. 



'Unlike the Voice Search experiments reported in Table |l] no interpolation between the first and the second pass language model was 
performed. In our experience that consistently yields small gains in accuracy. 
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